Validity and reliability of “My Jump app” to assess vertical jump performance: a meta-analytic review

This systematic review and meta-analysis aims to investigate the validity and reliability of the My Jump smartphone application in measuring vertical jump height, specifically using flight-time-based measures. To identify potential studies for inclusion, a comprehensive search strategy was employed in PubMed, Web of Science, Scopus, and EBSCO host databases. Validity was assessed in two ways: (1) mean and standard deviations of My Jump measurements were compared to criterion methods to assess the agreement of raw scores; (2) correlation coefficients evaluated the within-group consistency of rankings between My Jump and criterion methods. Reliability was assessed using intraclass correlation coefficients (ICC). Heterogeneity was evaluated via Cochrane’s Q statistic, its p-value, I2 value, and tau2 value. Publication bias was explored through funnel plot symmetry and confirmed with extended Egger’s test. Following the search, 21 studies met the inclusion criteria. Results showed no significant difference in raw scores between My Jump and criterion methods, indicating high agreement. High correlation was also found for within-group rankings, suggesting consistency. The My Jump application demonstrated nearly perfect reliability scores. The My Jump application appears to be a valid and reliable tool for sports scientists and strength and conditioning practitioners, offering a cost-effective and accessible means for accurately assessing vertical jump performance in various settings. However, it should be noted that these results are specific to flight-time-based measures, and further research is needed to validate these findings against gold-standard take-off velocity methods.

The emergence of novel devices (e.g.My Jump smartphone application, GymAware, PUSH Band) measuring athletic performance is quickly gaining momentum as these devices increase in popularity as potential alternatives to expensive laboratory equipment 1,2 .Their main advantage is that these novel devices are easily portable (especially in the case of software applications that are integrated into tablets and smartphones); they have the potential to offer an excellent solution to the problems of many laboratory-based measurement methods such as the high cost of laboratory equipment, the difficulties of transporting the devices to the field, or people to the laboratory, and the need for periodic maintenance and complex interfaces [2][3][4] .However, to take advantage of all these facilitating aspects, it is necessary to ensure that the measurements made with these methods give valid and reliable outputs.
Validity of the measurement of athletic characteristics requires that the movement pattern is close to the mode and profile observed during competition, that there is an output that represents specific proficiency, that the measurements are associated with a proven gold standard or a criterion measurement, that the evaluations can predict the actual competition performance, and that the results can distinguish between the successful and unsuccessful athletes [5][6][7] .Similarly, reliability of the measurement of athletic characteristics requires that the measurements can be replicated, that the within-group ranking and agreement between the raw scores can be maintained from test to retest, and that successive measurements give the same output when performed in such a short time that no actual performance improvement is possible [8][9][10] .Hence, a novel device or method designed to measure a component of physical fitness is expected to meet all these requirements regarding validity and reliability.

Inclusion criteria
The inclusion criteria for the articles in the systematic review and meta-analysis were focused on the types of studies, testing methods, participants, variables, statistical analysis, and reported outputs.To be included, the studies had to: (1) be original research; (2) be published in a peer-reviewed scientific journal; (3) have the full text available in English; (4) investigate vertical jump performance; (5) have included human participants such as athletes, untrained individuals, adults, elderly, children, etc.; (6) have investigated validity or reliability scores of the My Jump app; (7) report the Pearson r correlation, regression, and intraclass correlation coefficients (ICCs), or means and standard deviations.

Exclusion criteria
The following types of studies were excluded from the present meta-analysis: (1) studies in a language other than English; (2) unpublished studies, reviews, book chapters, editorials, non-peer-reviewed texts, case studies, abstracts, theses; (3) studies not reporting any validity or reliability statistics; (4) studies that focused on animal experiments; (5) studies using an application other than My Jump; (6) studies examining a physical fitness characteristic other than the vertical jump; (7) studies examining the effect of an exercise intervention on vertical performance.

Data extraction
The extracted data included the authors, year of publication, sample subject characteristics (age, body mass, height), criterion, type of vertical jump, means and standard deviations of the My Jump and criterion measurements, and the validity and reliability outputs (Table 1).Three authors (CG, MT and SÖ) independently extracted the data from the selected articles using a pre-defined form created in Microsoft Excel (Microsoft Corporation, Redmond, WA, USA).If there were any disagreements between the authors about the extracted data, the accuracy of the information was re-checked to reach a consensus.

Methodological quality assessment
The methodological quality of each included study was assessed using a modified Downs and Black assessment scale 33 .A total of 8 domains were identified to evaluate the quality of reporting for studies included in this review: (1) the hypothesis/aim described; (2) whether the participants were representative of the target population; (3) the participant characteristics detailed; (4) the intervention procedure detailed; (5) the use of an appropriate reference test/criterion; (6) the use of appropriate statistical tests; (7) the main outcomes reported; (8) if the outcome measures valid and reliable.Each criterion was evaluated as low quality, moderate quality, high quality, inadequate, or unclear.

Meta-analyses
Meta-analyses were conducted using comprehensive meta-analysis software, version 2 for Windows (CMA, Biostat company, Englewood, NJ, USA) 34 .The meta-analysis of validity was performed in two ways: (1) the means and standard deviations were compared between the My Jump and criterion measurements to assess the agreement of raw scores; (2) the correlation coefficients were used to determine the consistency of the rankings within-group in the My Jump and criterion measurements.Additionally, a meta-analysis of reported ICCs was performed to confirm reliability.However, the types of ICCs or Pearson r coefficients reported in the studies were varied (inter-rater, intra-rater, within-subject, between the devices, between two test days, between the consecutive jump performances of the same participant, etc.).Therefore, when a study reported multiple Pearson r coefficients or ICCs from the same sample group, the study was used as a single unit of analysis to avoid the overestimation of its contribution to the pooled result due to double counting.The pooled correlation values were interpreted according to a random-effects model in case of any heterogeneity between studies (when the p-value of the Q statistic was less than 0.1) 35 .For validity, pooled correlations were classified as follows: 0-0.19, "no significant correlation"; 0.2-0.39,"low correlation"; 0.4-0.59,"moderate correlation"; 0.6-0.79,"moderately high correlation"; and ≥ 0.8, "high correlation" 36,37 .The scale designed by Landis and Koch reliability strength thresholds was applied, as follows: 0.01-0.20,"mild reliability"; 0.21-0.40,"fair reliability"; 0.41-0.60,"moderate reliability", 0.61-0.80,"substantial reliability"; and 0.81-1.00,"nearly perfect reliability" 38 .Sub-analyses were meticulously conducted to delve into the nuances of various factors affecting the outcomes.These sub-analyses were categorized based on three key parameters: the type of jump performed (CMJ, SQJ, and DJ), the criterion device used for measurement (Force Plates and other devices), and the type of reliability assessed (inter-rater and intra-rater reliability, inter-session, and within-session reliability).
Heterogeneity was determined by Cochrane's Q statistic and its p-value, I-squared value, and tau-squared value 35,39,40 .The Q-value (and its p-value), which indicates whether all the studies have shared a common effect size, and the I-squared value, which refers to the proportion of the observed variance when the sampling error is eliminated (i.e. when observing the true effect size for all studies in the analysis), are the most common heterogeneity indicators 37,39,41,42 .The I-squared values of < 25%, 25-75%, and > 75% were considered to represent low, moderate, and high levels of heterogeneity, respectively 43 .The Tau-squared value is a measure of the variance of true effects representing a concrete and reliable heterogeneity 31,35,41 .
The risk of publication bias was explored using funnel plot symmetry, and asymmetries were confirmed using the extended Egger's test 44  www.nature.com/scientificreports/standard errors of the correlation coefficients).A significant coefficient for Egger's test means that the effect sizes and sampling variance for each study are related and indicates that a publication bias is present.In the case of evidence of a publication bias, Duval and Tweedie's "trim and fill" procedure was applied to determine whether estimates required adjustment based on missing studies 37 .Additionally, sensitivity analyses were conducted by removing a study 14 with validity concerns to assess the robustness of the pooled estimates.

Study selection
We initially found a total of 74 potential research articles related to the My Jump smartphone application published until September 2023.After excluding the 44 duplicates and 4 studies based on their titles and abstracts, 26 studies were reviewed as full texts.Following the identification of studies meeting the inclusion criteria of this paper, a total of 21 studies consisting of 839 accumulated participants were included in the present metaanalysis (Fig. 1).

Methodological quality
Using a modified Downs and Black assessment scale 33 , eight risk domains for the 21 individual research articles (a total of 168 scores) were evaluated.There were only nine items scored as low quality and one item scored as inadequate, while all other items were rated as moderate or high quality; therefore, the overall methodological quality was considered as moderate-to-high (Table 2).

Heterogeneity and publication bias outputs
The heterogeneity statistics and publication bias were assessed for three main categories: Mean differences, Reliability analysis (ICC values), and Validity analysis (r values).For mean differences, the Cochran Q statistic was 46.67 (p < 0.001), with an I 2 value of 70.0% and a tau 2 of 0.062.Egger's test for publication bias was not significant (p = 0.860).For reliability analysis, the Cochran Q statistic was notably high at 512.5 (p < 0.001).The I 2 value was 96.5%, and tau 2 was 0.397.Egger's test indicated a p-value of 0.156.For validity analysis, the Cochran Q statistic was 959.2 (p < 0.001), with an I 2 value of 98.3% and a tau 2 of 0.864.Egger's test showed a p-value of 0.436.High I 2 values, such as those observed in the Reliability and Validity analyses, indicate substantial heterogeneity among the included studies.An I 2 value above 75% is generally considered to represent considerable heterogeneity.This suggests that the observed variations in effect sizes are not solely due to sampling error but may be attributed to other factors, such as methodological differences or population characteristics among the studies (Table 3).

Validity outputs
The meta-analysis conducted for the agreement between raw scores showed that there was no significant difference between My Jump and the criteria (Hedge's g = − 0.047; p = 0.21, Fig. 2).Further analyses showed a significant heterogeneity (Q = 46.67;p < 0.001; tau 2 = 0.062), with an I 2 value indicating 70.0% of effect size variance accounted for across the individual studies (Table 3).Because of the significant heterogeneity, the pooled effect www.nature.com/scientificreports/size was conducted according to the random-effects model.Additionally, the risk of publication bias was explored using funnel plot symmetry and confirmed using the extended Egger's test (Table 3).Egger's test did not show any potential asymmetry (p = 0.860).
The sub-analysis for CMJ included 13 studies.The fixed effect model yielded a Hedge's g value of − 0.060 (95% CI − 0.151 to 0.032, p = 0.202), while the random effect model indicated a Hedge's g value of − 0.047 (95%   www.nature.com/scientificreports/CI − 0.233 to 0.139, p = 0.622).The studies exhibited varying degrees of relative weight, ranging from 3.04 to 14.07%.In the sub-analysis for Squat SQJ, three studies were included.The fixed and random effect models both showed a Hedge's g value of − 0.020 (95% CI − 0.257 to 0.217, p = 0.869).The studies in this category had relative weights of 31.38%,32.83%, and 35.78%.The sub-analysis for DJ comprised four studies.Both the fixed and random effect models indicated a Hedge's g value of − 0.034 (95% CI − 0.295 to 0.226, p = 0.795).The relative weights for the studies in this sub-analysis ranged from 10.50 to 37.79% (Table 4).
The sub-analysis focused on studies utilizing Force Plates included six studies.The fixed and random effect models both indicated a Hedge's g value of 0.015 (95% CI − 0.154 to 0.184, p = 0.863).The relative weights for the studies in this sub-analysis ranged from 4.41 to 29.95%.In the sub-analysis focused on studies not utilizing Force Plates, nine studies were included.The fixed effect model showed a Hedge's g value of − 0.073 (95% CI − 0.166 to 0.020, p = 0.124), while the random effect model indicated a Hedge's g value of − 0.075 (95% CI − 0.302 to 0.151, p = 0.515).The studies in this category had relative weights ranging from 4.69 to 16.27% (Table 5).
The meta-analysis conducted for identifying the consistency of the rankings within-group showed a high correlation (r = 0.989) between My Jump and criteria while individual studies reported correlations ranging from 0.813 to 0.999 (Fig. 3).Further analyses showed a significant heterogeneity (Q = 959.2;p < 0.001; tau 2 = 0.864), with an I 2 value indicating 98.3% of effect size variance accounted for across the individual studies (Table 3).The risk of publication bias was explored using funnel plot symmetry and confirmed using the extended Egger's test (Table 3).Additionally, the risk of publication bias was explored using funnel plot symmetry and confirmed using the extended Egger's test (Table 3).Egger's test did not show any potential asymmetry (p = 0.436).
The correlation analyses were conducted to evaluate the validity of different criterion measures, specifically force plates, and non-force plates, in assessing jump performance.The fixed effect model for studies using force plates showed a correlation of 0.994 (95% CI 0.992-0.995,p < 0.001), while the random effect model indicated a correlation of 0.992 (95% CI 0.968-0.998,p < 0.001).On the other hand, for studies not using force plates, the fixed effect model revealed a correlation of 0.981 (95% CI 0.978-0.983,p < 0.001), and the random effect model showed a correlation of 0.985 (95% CI 0.951-0.995,p < 0.001).These results highlight the robustness and high validity of both types of criterion measures in assessing different types of jumps, as evidenced by the consistently high correlations across studies (Table 7).

Reliability outputs
The meta-analysis conducted for identifying the reliability of the My Jump smartphone application showed nearly perfect reliability scores (r = 0.986) while individual studies reported correlations ranging from 0.748 to 1.0 (Fig. 4).Further analyses showed a significant heterogeneity (Q = 512.5;p < 0.001; tau 2 = 0.397), with an I 2 value indicating 96.5% of effect size variance accounted for across the individual studies (Table 3).The risk of publication bias was explored using funnel plot symmetry and confirmed using the extended Egger's test (Table 3).Egger's test did not show any potential asymmetry (p = 0.386).
The ICC was used to assess the reliability of different types of jumps, including CMJ, SQJ, and DJ.For CMJ, the fixed effect model showed an ICC of 0.969 (95% CI 0.965-0.972,p < 0.001), and the random effect model indicated an ICC of 0.982 (95% CI 0.961-0.992,p < 0.001).For SJ, the fixed effect model revealed an ICC of 0.965 (95% CI 0.953-0.974,p < 0.001), and the random effect model showed an ICC of 0.961 (95% CI 0.889-0.987,p < 0.001).In the case of DJ, the fixed effect model indicated an ICC of 0.972 (95% CI 0.964-0.979,p < 0.001), while the random effect model showed an ICC of 0.987 (95% CI 0.940-0.997,p < 0.001).These results suggest that the methods used for assessing different types of jumps are highly reliable, as evidenced by the consistently high ICCs across studies (Table 8).
For studies that utilized Force Plates for Criterion Measurement, the fixed effect model showed an ICC of 0.989 (95% CI 0.986-0.991,p < 0.001), and the random effect model indicated an ICC of 0.993 (95% CI 0.981-0.997,p < 0.001).For studies that did not utilize Force Plates for Criterion Measurement, the fixed effect model showed an ICC of 0.960 (95% CI 0.955-0.964,p < 0.001), and the random effect model indicated an ICC of 0.972 (95% CI 0.936-0.988,p < 0.001).These findings suggest that both methods, whether utilizing force www.nature.com/scientificreports/plates or not, are highly reliable for the measurements they aim to assess, as evidenced by the consistently high ICCs across studies (Table 9).The sub-analysis revealed high levels of reliability across different contexts.For inter-rater reliability, the fixed effect model showed an ICC of 0.993 (95% CI 0.991-0.994,p < 0.001), and the random effect model indicated an ICC of 0.996 (95% CI 0.987-0.999,p < 0.001).In terms of intra-rater reliability, the fixed effect model revealed an ICC of 0.995 (95% CI 0.993-0.996,p < 0.001), and the random effect model showed an ICC of 0.997 (95% CI 0.990-0.999,p < 0.001).For inter-session reliability, both the fixed and random effect models showed an ICC of 0.970 (95% CI 0.954-0.981,p < 0.001).Lastly, within session/device reliability had a fixed effect model ICC of 0.946 (95% CI 0.940-0.952,p < 0.001) and a random effect model ICC of 0.973 (95% CI 0.945-0.987,p < 0.001).These consistently high ICCs across studies suggest that the methods used for assessing different aspects of reliability are highly robust (Table 10).

Discussion
In this systematic review and meta-analysis, the validity and reliability findings of the My Jump smartphone application, which is designed to measure vertical jump performance, were summarized using meta-analytical methods.This review summarized the findings of 21 studies consisting of 839 accumulated participants.Overall methodological quality assessment for individual studies included in this meta-analysis was considered as moderate-to-high quality.Further analyses showed significant heterogeneity scores; thus, the pooled calculations were interpreted according to the random-effects model.For validity, meta-analysis results revealed that there was a raw score agreement between My Jump and the criterion measures, based on nonsignificant Hedge's g values as well as a high consistency of the within-group rankings, based on the pooled correlation result.For reliability, our meta-analysis showed near-perfect reliability for My Jump, based on the pooled ICC value.Additionally, sub-analyses suggested that the results were robust across different types of jumps, reference devices used, and types of reliability.
In fact, unlike the usual study designs that investigate the validity and the test-retest reliability of athletic performance measures, validity, and reliability analyses of the My Jump application can be completed without the need for re-testing 14,45,46 .As the participant performs a vertical jump once on a platform, the nature of which is accepted as the criterion (force plate, mat, photocell sensors, etc.), a video can be simultaneously recorded 12,13,47 .Thus, possible biases can be attributed to other factors not regarding participants.For example, because take-off and landing points are manually marked, minor variations are likely when a rater measures the same vertical jump performance consecutively.Or, in a video recording measuring a single jump performance, two raters may mark take-off and landing points differently.However, the My Jump provides a very functional method to minimize these errors, as it offers the possibility to pause the video and play it frame by frame 12,13,47 .Additionally, the formula it uses (h = t 2 × 1.22625) 48 is equivalent to most criterion devices.In this case, the major handicap seems to be small variations that can arise from manually determining the take-off and landing points.While the My Jump application relies on flight time to calculate jump height, force platforms are often considered the gold standard due in part to their ability to calculate jump height based on the impulse-momentum theorem, which takes into account the total force applied during the jump and the duration of this force, providing a more holistic assessment 15,16 .However, it's worth noting that flight time-based calculations are also commonly used in force platforms.In fact, in the initial study introducing the My Jump, a force platform was used as a reference device, and it too employed a flight time-based methodology for the sake of comparison 47 .This methodological overlap offers some advantages.For instance, strategies that could artificially lengthen flight time are also applicable to force platform measurements based on flight time 15 .Therefore, in scenarios where the device rather than the method serves as the reference for jump performance, My Jump appears to be a viable alternative.The primary objective of our study is to test whether My Jump can serve as an alternative to more expensive and less portable devices.Our findings suggest that My Jump can be reliably used for practical applications such as ranking the jump performance of members within a group or tracking an athlete's jump performance over time, provided that the same methodology is consistently applied.
The importance of a high sampling rate is indeed critical for the My Jump app, which utilizes video recordings to calculate jump metrics 3,19,47 .In the study introducing this smartphone application, an iPhone 5 s was employed, featuring a 120 Hz high-speed camera at a quality of 720p, deemed adequate for such calculations 47 .Moreover, newer models of the device offer even more advanced capabilities 3,49,50 .When calculating jump height based on flight time, consider an average jump height of, for instance, 30 cm.The time an athlete would spend airborne for such a jump is approximately half a second.With a 120 Hz high-speed camera, this duration would translate into 60 frames (120 frames/second * 0.5 s = 60 frames).This high frame rate can allow for an accurate and reliable calculation of flight time, thus offering a valid measurement of jump height.What appears to be crucial is the adoption of a standardized procedure for selecting the frames where the jump starts and ends, ensuring consistency and reliability in measurements 1,51,52 .
Measurement errors related to instruments and raters in vertical jump assessments by smartphone applications can be caused to miss clinically crucial changes in performance 1,2,8 .However, while all of the www.nature.com/scientificreports/studies 3,[12][13][14]19,[45][46][47]49,[53][54][55] that evaluated the validity and reliability of the My Jump included designs in which vertical jump performance was measured with two devices at the same time, no study had a comprehensive design containing the vertical jump height differences between two scores by the same rater and between the scores of two different raters, from the same video recording. Although some stuies reported the comparison results related to differences between successive jumps performed several minutes apart or the differences between two test days 55,56 , possible errors in these designs can be attributed to the participants.For an application such as My Jump, in which data can be collected simultaneously along with the criterion device, it is more critical to focus on errors between raters, between devices, and between the same participant's scores, rather than participantsourced factors.Presenting a pooled reliability score using all reported ICC scores of original studies since they did not consistently provide ICC reports can be considered as a limitation for the present systematic review and meta-analysis.Twenty studies 3,[12][13][14]19,[45][46][47][49][50][51][52][53][54][55][57][58][59][60][61] using force plate, contact mat, and photocell system to examine the validity and reliability of the My Jump reported high (ICC > 0.80) reliability scores.However, one study 14 compared vertical jump heights obtained from Vertec with the heights obtained from the My Jump and found the ICC score for absolute agreement to be 0.665.Although studies are showing that Vertec offers valid and reliable results 62,63 , considering the results of the other studies using more valid criterion methods, it seems highly likely that the inconsistency between the scores from two methods is related to the linear position transducers method used by Yingling et al. 14 .In addition, the fact that individual studies comprising participants that represent a wide range of the population, such as healthy adults, athletes, both men and women, children, and the elderly, strengthen the competence of this smartphone application to produce valid and reliable outputs.Consequently, the present systematic review and meta-analysis showed that the My Jump presented high agreement and consistency scores with the force plate, contact mat, and photocell systems as reference methods, demonstrating a pooled nearly perfect reliability score.In addition to its low-cost and simplicity, the My Jump smartphone application could be considered a valid and reliable method of assessing vertical jump height in various populations (Supplementary Information).

Conclusions
This is the first investigation using meta-analytical methods to confirm the validity and reliability of the My Jump smartphone application to measure vertical jump heights.In terms of validity, meta-analysis results revealed that there was a raw score agreement between My Jump and the criterion measures, based on nonsignificant Hedge's g values as well as a high consistency of the within-group rankings, based on the pooled correlation result.In terms of reliability, our present meta-analysis showed near-perfect reliability for My Jump, based on the pooled ICC value.Data from this systematic review and meta-analysis suggests that the My Jump can be used for assessing

Figure 1 .
Figure 1.Flow chart of the review process.
et al. (2018) H M H M H M M M Chow et al. (2023) H M H M M H M M Cruvinel-Cabral et al. (2018) H M M M M M M M Driller et al. (2017) I H H M H M M M Gallardo-Fuentes et al. https://doi.org/10.1038/s41598-023-46935-x

Figure 3 .
Figure 3. Forest plot for correlations between My Jump and related criterion measures.Values shown are correlation coefficient with 95% confidence intervals.The size of the plotted squares represents the relative weight of the study.

Table 1 .
Descriptive information of included studies.

Table 2 .
Methodological quality assessments of original studies included in meta-analyses.L low quality, M moderate quality, H high quality, I inadequate.

Table 3 .
Summary statistics related to the heterogeneity and publication bias.ICC intraclass correlation coefficients.Q Cochran Q statistic for homogeneity test, I 2 : the proportion of total variation caused by heterogeneity rather than within-study sampling error (%), Tau 2 : the variance in true effect sizes observed in different studies, Egger: Egger's regression test.Forest plot of differences between My Jump and related criterion measures.Values shown are Hedge's g with 95% confidence intervals.The size of the plotted squares represents the relative weight of the study.

Table 4 .
Sub-validity analyses for vertical jump types based on Hedge's g values.

Table 5 .
Sub-validity analyses for the criterion device based on Hedge's g values.

Table 6 .
Sub-validity analyses for vertical jump types based on correlation values.

Table 7 .
Sub-validity analyses for criterion device based on correlation values.Forest plot for ICCs of My Jump measures.Values shown are ICC with 95% confidence intervals.The size of the plotted squares represents the relative weight of the study.ICC intraclass correlation coefficients.andmonitoring vertical jump performance, which is a parameter included in global physical fitness test batteries and which provides information about the neuromuscular function and explosive power of the lower body.However, included studies mostly targeted on adults, only one study focused on children.More research need to be conducted on this population to precisely ensure the validity and reliability of My Jump smartphone application.

Table 8 .
Sub-reliability analyses for vertical jump types based on correlation values.

Table 10 .
Sub-reliability analyses for reliability types based on correlation values.